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Introduction 


Cancer  results  from  abnormalities  in  critical  genes  that  regulate  normal  cellular 
growth  and  development.  These  abnormalities  arise  in  two  classes  of  interacting  genes: 
those  that  facilitate  cell  growth  and  tumor  formation,  whose  over  expression  induces 
cancer,  and  those  that  inhibit  these  process  whose  loss  or  mutation  causes  cancer.  Genes 
that  belong  to  the  second  category  are  called  tumor  suppressor  genes.  These  tumor 
suppressor  genes  can  be  inactivated  by  two  different  mechanisms.  One  mechanism 
involves  the  mutation  of  the  coding  sequences  within  the  gene.  The  other  mechanism, 
which  is  being  increasingly  appreciated  more  and  more,  is  the  inactivation  of  the  tumor 
suppressor  genes  by  promoter  methylation.  These  promoters  are  portions  of  DNA 
sequences  found  closed  to  the  starting  portion  of  the  gene  and  often  undergo  changes  in 
their  molecular  structure  (methylation)  thereby  preventing  effective  transcription  of  these 
genes.  This  project  aims  at  developing  software  tools  to  find  out  these  areas  which  are 
rich  is  CpG 

Body 

Our  first  task  was  to  develop  the  software  for  the  identification  of  CpG  islands  in 
genomic  DNA  sequences.  While  this  project  was  started  there  was  no  software  available 
which  could  precisely  do  this  function.  So  we  set  out  to  achieve  this  goal  by  developing  a 
software  exclusive  for  this  purpose.  We  decided  to  choose  a  simple  and  efficient  platform 
such  “Microsoft  Excel”  which  is  commonly  available  with  most  of  the  academic 
scientists.  This  spread-sheet  based  program  is  very  versatile  for  handling  large  DNA 
sequences  and  also  can  be  easily  programmed  using  “Visual  Basic  for  Applications”  to 
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develop  software  for  custom  use.  We  initially  developed  a  custom  Excel  spread-sheet  in 
which  a  sequence  to  be  analyzed  could  be  copied  and  pasted  from  other  public  domain 
databases  (Anbazhagan  et  al,  1991).  Upon  execution  of  the  program,  a  customized 
workbook  analyzes  an  entered  DNA  sequence  for  the  total  number  and  percentage 
cytosine  and  guanine  nucleotides,  the  total  number  and  percentage  of  CpG  sites,  and  a 
CpG:GpC  ratio.  The  program  also  displays  the  distribution  of  CpG  sites  in  a  visual 
format  as  well  as  in  two  different  graphical  formats.  Finally,  the  program  assists  in 
laboratory  studies  of  DNA  methylation  that  employ  bisulfite  modification  of  DNA  by 
displaying  methylation-dependent  effects  of  bisulfite  treatment  on  DNA  sequences. 

While  this  program  is  very  useful  for  the  analysis  of  individual  sequences,  it 
couldn’t  handle  large  number  of  sequences  for  large-scale  analysis.  So  we  have  modified 
this  program  later,  so  that  thousands  of  sequences  can  be  analyzed  at  the  same  time. 
Using  this  modified  program  we  have  analyzed  40000  clones  from  the  Research  Genetics 
and  identified  162  clones,  which  showed  features  consistent  with  CpG  island.  By 
employing  restriction  sensitive  comparative  genomic  hybridization  we  have  identified  7 
clones,  which  are  probably  differentially  methylated  in  breast  cancers.  These  clones  are 
being  evaluated  for  their  usefulness  as  diagnostic  markers  for  breast  cancer. 

Key  Research  Accomplishments 

Developed  a  software  tools  to  analyze  the  genomic  DNA  and  cDNA  and  identify 
specific  regions,  which  are  rich  in  CG  content.  Using  this  program  identified  162  clones 
from  which  show  features  suggestive  of  being  potential  sites  for  CpG  methylation. 
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Employing  high  through  put  technology  for  the  identification  of  differentially  methylated 
clones  in  breast  cancer,  selected  7  clones,  which  show  differential  methylation  in  breast 
cancer. 

Reportable  Outcomes 

Manuscript  published  in  BioTechniques  journal  regarding  software  developed  for  the 
analysis  of  CpG  islands  (Anbazhagan  et  al,  1991). 

Conclusions 

While  the  software  developed  for  the  identification  of  CpG  islands  will  be  very 
useful  in  future  work  in  this  area,  further  evaluation  of  the  differentially  methylated 
clones  would  help  to  develop  new  breast  cancer  markers  for  diagnostic  use. 
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ABSTRACT 

Methylation  of  DNA  in  CpG  dense  re¬ 
gions  of  gene  promoters  ( CpG  islands)  is 
important  for  transcriptional  inactivation  of 
selective  genes  in  normal  and  neoplastic 
cells.  Here,  we  present  a  spreadsheet-based 
program  adapted  from  Microsoft®  Excel® 
that  is  useful  for  identifying  CpG  islands 
and  for  assisting  in  the  laboratory  analysis 
of  DNA  methylation  of  these  regions.  Upon 
execution  of  the  program,  a  customized 
workbook  analyzes  an  entered  DNA  se¬ 
quence  for  the  total  number  and  percentage 
cytosine  and  guanine  nucleotides,  the  total 
number  and  percentage  of  CpG  sites,  and  a 
CpG.'GpC  ratio.  The  program  also  displays 
the  distribution  of  CpG  sites  in  a  visual  for¬ 
mat  as  well  as  in  two  different  graphical 
formats.  Finally,  the  program  assists  in  lab¬ 
oratory  studies  of  DNA  methylation  that 
employ  bisulfite  modification  of  DNA  by 
displaying  methylation-dependent  effects  of 
bisulfite  treatment  on  DNA  sequences. 


INTRODUCTION 

While  the  CpG  dinucleotide  occurs 
at  less  than  10%  of  its  expected  frequen¬ 
cy  in  the  human  genome,  many  gene 
promoter  regions  have  high  densities  of 
CpG  sites.  These  regions  are  known  as 
CpG  islands,  and  methylation  of  cy¬ 
tosines  in  CpG  islands  is  a  common 
mechanism  for  transcriptional  inactiva¬ 
tion  of  genes  on  the  inactivated  X-chro- 
^  mcrsonie  ~in~  females:  ~and-  -ofr  silenced- 
imprinted  alleles  on  autosomal  chromo¬ 
somes.  Furthermore,  methylation  is  in¬ 
creasingly  recognized  as  an  important 
mechanism  for  inactivation  of  tumor 
suppressor  genes  in  neoplasia  (reviewed 
by  Baylin  et  al.  in  Reference  2). 

CpG  islands  are  defined  as  regions 
of  DNA  ranging  in  size  from  0.5  to  2  kb 
that  have  a  C  +  G  content  of  greater 
than  60%  and  a  ratio  of  CpG  to  GpC  of 
at  least  0.6  (3,4).  Identification  of  CpG 
islands  is  a  critical  initial  step  for  study¬ 
ing  transcriptional  regulation  of  genes 
by  methylation.  The  status  of  CpG  is¬ 
lands  can  then  be  evaluated  in  the  labo¬ 
ratory  using  restriction  enzymes  that 
are  methylation  specific,  by  sequencing 
bisulfite  modified  DNA,  or  by  methyla- 
tion-specific  PCR  (MSP)  (5)  of  bisul- 
fite-modified  DNA.  Here,  we  report  the 
adaptation  of  a  commonly  available 
spreadsheet  program,  Microsoft®  Ex¬ 
cel®  2000,  as  a  tool  for  identifying 
CpG  islands  and  assisting  in  the  analy¬ 
sis  of  DNA  methylation. 


MATERIALS  AND  METHODS 

Downloading  and  Opening  the 
Program 

We  have  created  an  Excel  file  with  a 
built-in  program  that  performs  all  major 


functions  required  for  the  analysis  of 
CpG  islands.  The  program  wor^  with 
Excel  2000.  The  program  file  (CpG- 
Win)  can  be  downloaded  from  the  Soft¬ 
ware  Library  on  the  Internet  at  http:// 
128.20.85.49/genomics.  After  down¬ 
loading  the  file,  the  program  can  be 
started  in  two  ways:  (i)  either  open  the 
downloaded  file  or  (//)  start  the  Excel 
program  and  open  the  downloaded  file 
from  within  the  program  using  File/ 
— Gpen.^-dialogue-box-appearsT~asking 
to  enable  or  disable  macros.  Click  on 
the  Enable  Macros  button.  When  the 
program  opens,  you  will  notice  that  a 
new  menu  named  Analyze-CpG  is 
added  to  the  program  menu  bar.  This 
menu  has  seven  menu  items,  namely 
New-Workbook,  Format,  Count-CpG, 
Mark-CpG,  Bis-Modify,  and  M^e- 
Graph.  These  menu  items  can  be  exe¬ 
cuted  to  perform  various  functions  nec¬ 
essary  for  the  analysis  of  CpG  islands 
within  the  DNA  sequences. 

Opening  a  New  Workbook  and 
Entering  a  Sequence 

The  Excel  workbook  is  organized  as 
a  collection  of  sheets,  with  each  sheet's 
name  appearing  on  a  tab  at  the  bottom 
of  the  workbook.  Navigation  between 
sheets  is  possible  by  clicking  on  the  tab 
at  the  bottom.  Refer  to  the  Microsoft 
User's  Guide  for  further  explanation 
about  the  organization  and  use  of  Excel 
>vorkbook  and  sheets.  To  use  this  pro¬ 
gram,  go  to  the  newly  added  Analyze- 
CpG  menu  and  click  on  the  New- Work¬ 
book  menu  item.  This  command  will 
open  a  new  workbook  with  four  sheets. 
The  program  automatically  formats  the 
Sheet2  as  shown  in  Figure  1.  To  ana¬ 
lyze  a  specific  DNA  sequence,  enter  the 
nucleotide  sequence  in  cell  Al  of 
Sheet  1.  Nucleotides  in  the  sequence 
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may  be  entered  in  either  uppercase  or 
lowercase.  Ambiguous  nucleotides 
may  be  designated  as  “n”.  The  cells  can 
hold  more  than  30  000  characters  and 
can  therefore  accommodate  long  se¬ 
quences.  These  sequences  may  be 
copied  and  pasted  from  other  docu¬ 
ments.  GenBank®  sequences  viewed 
through  internet  browsers  can  also  be 
directly  copied  and  pasted  along  with 
numbers  and  spaces.  If  you  copy  and 
paste  sequences  from  GenBank,  you 
will  notice  that  the  sequence  occupies 
multiple  rows  in  Sheetl.  After  entering 
or  pasting  the  sequence  in  cell  A1  of 
Sheetl,  go  to  the  Analyze-CpG  menu 
and  click  on  the  Format  menu  item. 
This  command  removes  all  the  num¬ 
bers  and  spaces  from  the  sequence  (if 
there  are  any),  concatenates  the  se¬ 
quences  in  all  the  rows,  and  also  trans¬ 
fers  the  sequence  from  cell  A1  of 
sheetl  to  cell  12  of  Sheet2.  The  com¬ 
mand  also  activates  Sheet2  so  that  you 
-can  see  the-transferrecksequence-there._  _ 
The  program  automatically  fills  up 
cell  J3  of  Sheet2  with  a  formula  that 
will  count  and  display  the  total  number 
of  nucleotides  in  cell  J2  of  Sheet2. 
When  a  sequence  is  transferred  from 
Sheetl  to  cell  J2  of  Sheet2,  the  nu¬ 
cleotide  count  of  that  sequence  is  auto¬ 


matically  displayed  in  cell  J3.  Tlie  nu¬ 
cleotide  count  in  cell  J3  is  also  updated 
automatically  whenever  any  sequence 
entry  is  altered  in  cell  J2. 

Analyzing  the  CpG  and  GpC 
Distribution  in  Different  Regions 
of  the  Sequence 

The  next  step  is  to  designate  which 
portions  of  the  sequence  are  to  be  ana¬ 
lyzed  for  CpG  islands.  This  is  done  by 
entering  the  starting  and  ending  position 
numbers  of  the  fragments  to  be  analyzed 
in  column  G  and  H,  respectively,  below 
the  labels  Start  and  End.  Up  to  six  frag¬ 
ments  can  be  entered  in  these  columns 
with  the  starting  and  ending  position 
numbers  of  their  nucleotides.  After  en¬ 
tering  these  values,  go  to  the  Analyze- 
CpG  menu  and  click  on  the  Count-CpG 
menu  item.  After  a  few  seconds  delay 
(depending  on  the  length  of  the  se¬ 
quence),  the  analyzed  data  are  displayed 
in  columns  J-X  as  shown  in  Figure  2.  _ 

Column  J  (labeled  “Length”)  dis¬ 
plays  the  length  (number  of  nucleotides) 
of  each  of  the  fragments  entered  in 
columns  G  and  H.  Column  L  (labeled 
“Number  of  C+G”)  displays  the  total 
number  of  cytosine  and  guanine  nu¬ 
cleotides  within  each  of  the  fragments. 


The  percentage  of  cytosine  and  guanine 
nucleotides  within  each  fragment  is  dis¬ 
played  in  colunm  N  (labeled  “(C+ 
G)%”).  Column  P  (labeled  “Number  of 
CpG”)  displays  the  total  number  of  CpG 
islands  within  each  of  these  fragments. 
The  percentage  of  CpG  islands  within 
each  of  these  fragments  is  displayed  in 
column  R  (labeled  “CpG%”).  Similarly, 
the  total  number  and  the  percentage  of 
GpC  base  pairs  are  displayed  in 
columns  T  and  V  (labeled  “Number  of 
GpC”  and  “GpC%”,  respectively).  The 
last  column,  X,  shows  the  CpG:GpC  ra¬ 
tio  in  the  sequence. 

Marking  the  CpG  and  GpC  Base 
Pairs  for  Visual  Display 

Although  the  various  parameters  of 
the  CpG  islands  are  andyzed  as  de¬ 
scribed  above,  it  does  not  give  a  visual 
impression  of  the  distribution  of  the 
CpG  and  GpC  sites  within  a  sequence. 
The  next  menu  item,  Mark-CpG,  is 
meant  to  perfonhlE^^fimctidiu^VH^^ 
this  menu  item  is  executed,  the  pro¬ 
gram  takes  the  sequence  in  cell  J2  of 
Sheet2  and  displays  it  in  columns  B 
and  D  (labeled  “CpG  Marked”  and 
“GpC  Marked”,  respectively)  of  Sheets 
in  a  vertical  column  format  with  each 
nucleotide  occupying  a  single  ceU.  The 
program  also  highlights  the  CpG  sites 
in  column  B  and  GpC  sites  in  column 
D  with  a  bright  pink  background.  This 
highlighting  helps  to  visually  appreci¬ 
ate  the  distribution  of  the  CpG  and  GpC 
sites  (Figure  3).  This  visual  effect  can 
be  better  appreciated  by  clicking  on  the 
View  menu  bar  and  executing  the 
Zoom  function,  which  are  built-in  Ex¬ 
cel  commands.  By  adjusting  the  zoom 
ratio  to  10%  of  the  original,  an  overall 
view  the  marked  sequence  can  be  seen. 

Displaying  the  Bisulfite-Modified 
DNA  Sequence 

MSP  and  sequencing  of  bisulfite- 
modified  DNA  are  commonly  employed 
to  analyze  methylation  of  DNA.  These 
techniques  both  involve  treating  the 
DNA  with  bisulfite,  which  chemically 
modifies  the  unmethylated  cytosine 
residues  to  uracil.  In  effect,  this  modifi¬ 
cation  is  reflected  as  thymine  after  se¬ 
quencing  or  MSP.  Bisulfite  treatment  of 
DNA  therefore  results  in  different  prod- 
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Figure  1.  Sheet2  of  a  new  workbook  opened  by  executing  the  menu  item  New- Workbook  in  the 
CpG-Analyze  menu. 
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ucts,  depending  on  the  methylation  sta¬ 
tus  of  the  DNA  before  modification. 

The  next  command  in  our  custom 
menu,  Bis-Modify,  helps  to  identify 
possible  nucleotide  sequences  after 
bisulfite  treatment,  depending  on  prior 


methylation  status.  When  this  menu 
item  is  executed,  the  DNA  sequence 
fi’om  cell  J2  of  Sheet2  is  displayed  in 
columns  G,  H,  and  I  of  Sheet3  (labeled 
“Original”,  “Unmethylated”  and 
“Methylated”,  respectively),  in  a  verti- 
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Figure  3.  A  view  of  Sheet3  displaying  results  of  executing  Mark-CpG  and  Bis-Modify  menu  Items. 
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u  cal  fonnat,  with  each  nucleotide  occu- 
B  pying  a  single  cell.  The  column  labeled 
J  “Onginal”  displays  the  original  se¬ 
quence  for  the  purpose  of  comparison 
I  with  others.  The  column  labeled  “Un¬ 
methylated”  displays  the  modified  se¬ 
quence  that  would  result  from  bisulfite 
treatment  of  unmethylated  DNA.  Note 
that  in  this  situation,  all  cytosine  nu¬ 
cleotides  would  be  converted  to 
thymine  regardless  of  whether  or  not 
Aey  are  associated  with  a  part  of  a  CpG 
island.  These  modified  nucleotides  are 
highlighted  with  a  bright  pink  back- 
^ound.  The  next  column  labeled 
Methylated  displays  the  sequence 
that  would  result  from  bisulfite  treat¬ 
ment  of  DNA  with  methylated  se¬ 
quences  in  CpG  islands.  In  this  situa¬ 
tion,  all  cytosines  not.  followed  by 
guanine  (th.us  not  making  a  CpG  pair) 
are  modified  to  thyniine,  while  those 
— foHowedl  by-guaninep.(thus-inafcing  -a 
CpG  parr)  ^e  reined  as,  cytosines.  As 

“ :*e  prwtpuSjMse.'^i  foe  m^^ 

nucleqtidq^  are ,  highlighted  with  a 
brisht  pink  tiackground.  Thus,  foe  exe¬ 
cution  of  tiiis  menu  iteih,  displays  all 
three  sequences  (original,  unmethylat¬ 
ed,  and  methylated)  side  by  side  with 
foe  modified  nucleotides  being  high¬ 
lighted.^  This  display  format  facilitates 
coipparison  of  tije  sequences  and  iden¬ 
tification  of  possible  sites  of  mefoyla- 
tion.  The  results  of  bisulfite  modifica¬ 
tion  of  foe  complementary  strand  are 
displayed  in  columns  K,  L,  and  M. 

To  design  primers  for  MSP,  it  may 
be  desirable  ,  to  copy  these  unmodified 
and  modified  sequences  to  another  file 
or  to  another, program  used  for  design¬ 
ing  primers.  To  make  it  easy  to  copy 
these  sequences,  they  are  also  dis¬ 
played  in  a  hbrizdntal  format  with  each 
sequence  occupying  a  single  cell  as 
shown  in  Figure  3.  Actually,  foe  menu 
item  Bis-Modify  performs  this  function 
also.  The  unmethylated  or  methylated 
version  of  DNA  sequence  could  be  eas¬ 
ily  copied  from  column  O. 


Displaying  the  CpG  and  GpC 

Base  Pairs  in  a  Graphical  and 
Pseudographical  Formats 

The  next  menu  item,  Make-Graph, 
is  used  to  display  foe  distribution  of 
CpG  and  GpC  base  pairs  in  graphical 
formats.  When  this  command  is  execut- 
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ed,  the  program  takes  the  sequence  in 
cell  J2  of  Sheet2,  analyzes  the  CpG  dis¬ 
tribution,  and  plots  two  graphs.  First, 
the  program  scans  the  sequence 
through  a  window  of  10  bp,  computes 
the  CpG  frequency,  makes  a  line  graph, 
and  displays  it  in  Sheet4  as  shown  in 
Figure  4.  The  Y-axis  in  this  graph  rep¬ 
resents  the  exact  number  of  CpG  base 
pairs  counted  within  a  window  of  10 
bp.  Then,  the  program  scans  the  se¬ 
quence  again  through  a  window  of  100 
bp,  computes  the  CpG  frequency,  and 
displays  the  graph  in  Sheet4  just  below 
the  earlier  graph.  The  Y-axis  in  this 
graph  represents  the  percentage  of  CpG 
frequency  and  has  a  fixed  range  of 
0%-I2%.  While  the  first  graph  will  be 
useful  for  the  analysis  of  shorter  frag¬ 
ments,  the  second  will  be  very  useful  to 
analyze  larger  fragments. 

The  program  also  makes  a  text 
string  “pseudograph”  for  CpG  and  GpC 
-  J)ase_  pairs..  In-this^pseudograph^  the^ 
CpG  or  GpC  pairs  are  represented  by 
the  “I”  symbol,  and  all  other  possible 
pairs  of  nucleotides  are  represented  by 
symbol.  For  example,  a  sequence 
such  as  “acgtcgac”  is  displayed 
to  represent  the  distribution  of  the  CpG 
pairs.  These  .pseudographs  are  dis 
played  just  above  the  previous  graph 
with  labels  “CpG  .base  pairs”  and  “GpC 


base  pairs”.  This  gives  another  visual 
tool  to  appreciate  the  distribution  of  the 
CpG  islands  in  a  sequence.  The  se¬ 
quence  and  the  results  can  be  saved  us¬ 
ing  the  built-in  Save  As  command  in 
Excel.  The  user  can  then  open  a  new 
workbook  and  continue  analyzing  new 
sequences,  if  necessary. 


DISCUSSION 

One  important  feature  that  we  have 
not  included  in  this  program  is  the  abil¬ 
ity  to  search  for  restriction  sites,  espe¬ 
cially  for  methylation-specific  enzymes 
within  the  DNA  sequence.  We  did  not 
include  this  feature  in  the  current  pro¬ 
gram  because  our  earlier  programs. 
Win- Align  and  Mac-Align^  .  can  per¬ 
form  these  functions  effectively  (Ij.'.fii 
summaty,  we  have  described  a  spread¬ 
sheet-based  program  for  the'analysif  of 

-CpG  islandsrTTiis  program  h^eatu^s" 
to  copy  and  paste  sequences,  from.' tte 
GenBank  database  and  would  bp  yej^, 
useful  for  identifying  CpG  frequency  in 
different  reg^ions  of  DNA.  The  user  can 
get  a  numerical  output  of  the  CpG  fre¬ 
quency  in  selected  regions  and  appreci¬ 
ate  the  output  visually  through  a  color- 
shaded  vertical  display  and  though 
graphical  and  pseudographic^  formatej. 


The  program  also  outputs  the  sequence 
that  is  expected  after  bisulfite  treatment 
of  DNA.  This  feature  will  be  especially 
useful  for  designing  primers  for  methy¬ 
lation-specific  PCR. 
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